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Abstract 

Task parallelism as employed by the OpenMP task construct or 
some Intel Threading Building Blocks (TBB) components, al- 
though ideal for tackling irregular problems or typical produc- 
er/consumer schemes, bears some potential for performance 
bottlenecks if locality of data access is important, which is 
typically the case for memory-bound code on ccNUMA sys- 
tems. We present a thin software layer ameliorates adverse ef- 
fects of dynamic task distribution by sorting tasks into locality 
queues, each of which is preferably processed by threads that 
belong to the same locality domain. Dynamic scheduling is 
fully preserved inside each domain, and is preferred over pos- 
sible load imbalance even if nonlocal access is required, mak- 
ing this strategy well-suited for typical multicore-mutisocket 
systems. The effectiveness of the approach is demonstrated by 
using a blocked six-point stencil solver as a toy model. 

1 Introduction 

1.1 Dynamic scheduling on ccNUMA systems 

"Cache-coherent nonuniform memory access" (ccNUMA) is 
the preferred system architecture for multisocket shared- 
memory servers today. In ccNUMA, main memory is logically 
shared, meaning that all memory locations can be accessed by 
all sockets and cores in the system transparently. However, 
since main memory is physically distributed, i.e., partitioned in 
so-called locality domains (LDs), access bandwidths and laten- 
cies may vary, depending on which core accesses a certain part 
of memory. Access is fastest from the cores directly attached 
to a domain. Nonlocal accesses are mediated by some inter- 
domain network, which is also capable of maintaining cache 
coherency throughout the system. 

The big advantage of ccNUMA is that the available main 
memory bandwidth scales with the number of LDs, and shared- 
memory nodes with hundreds of domains can be built. Many 
applications in science and engineering rely on large memory 
bandwidth; computational fluid dynamics (CFD) and sparse 



matrix eigenvalue solvers are typical examples. However, 
applications using shared-memory programming models like, 
e.g., OpenMP HI, TBB 0, or POSIX threads, should make 
sure that locality of access is maintained. Massive perfor- 
mance breakdowns may be observed when nonlocal (inter-LD) 
accesses or contention on an LD's memory bus become bot- 
tlenecks [ 3 ] . One should add that the current OpenMP stan- 
dard, although it is the dominant threading model for scientific 
user codes, does not contain any features that would enable 
ccNUMA access optimizations. 

Most operating systems support a first touch ccNUMA 
placement policy: After allocation (using, e.g., mallocO), 
the mapping of logical to physical memory addresses is not 
established yet; the first write access to an allocated memory 
page will map the page into the locality domain of the core 
that executed the write. This makes it straightforward to opti- 
mize parallel memory access in applications that have regular 
memory access patterns. If the loop(s) that initialize array data 
are parallelized in exactly the same way and use the same ac- 
cess patterns as the loops that use the data later, nonlocal data 
transfer can be minimized. A prerequisite for first touch initial- 
ization to work reliably is that threads are not allowed to move 
freely through the shared-memory machine but maintain their 
affinity to the core they were initially bound to. Some thread- 
ing models discourage the use of strong thread-core affinity, 
but numerically intensive high-performance parallel applica- 
tions usually benefit from it. Operating systems often provide 
libraries and tools to enable a more fine-grained control over 
page placement. Under Linux, the numactl command and the 
libmuna library are part of every standard distribution. 

Unfortunately the "first touch" scheme does not work in all 
cases. Sometimes memory access cannot be organized in con- 
tiguous data streams or, even if that is possible, the problem 
itself may be irregular and show strong load imbalance if a 
simple static work distribution is chosen. Dynamic scheduling 
is the general method for handling the latter case. The OpenMP 
standard [1| provides the dynamic and guided scheduling 
types for worksharing loops, and the task construct for task- 
based parallelism. In Intel Threading Building Blocks (TBB) 
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Table 1: Overview of the ccNUMA systems in the test bed. "NT" denotes that nontemporal stores were used in the STREAM 
benchmark as well as for the Jacobi solver test application. 
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Figure 1: Topology of the AMD Istanbul test system with four 
locality domains. The Intel Nehalem EX system is very similar 
but has eight instead of six cores per socket, and each socket 
has direct QPI connections to all other sockets. 
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Figure 2: Topology of the Intel Nehalem EP test system with 
two locality domains. 
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[2 1, a task is the central scheduling entity as well, and distribu- 
tion of tasks across threads is fully dynamic. If the additional 
overhead for dynamic scheduling is negligible for the appli- 
cation at hand, these approaches are ideal on UMA (Uniform 
Memory Access) systems like the now outdated single-core 
multi-socket SMP nodes, or multi-core chips with "isotropic" 
caches, i.e., where each cache level is either exclusive to one 
core or shared among all cores on a chip. On ccNUMA sys- 
tems, however, dynamic scheduling leads to nonlocal mem- 
ory accesses and contention on the LD's memory buses. The 
simplest option to choose is then to distribute memory pages 
across locality domains in a cyclic fashion using, e.g., the 
above-mentioned NUMA tools, which will lead to at least a 
certain degree of parallel memory access. Under Linux, one 
may write: 

> env MP _NUM_ THREADS =8 numactl -i 0-3 ./a. out 

This will start the (OpenMP) binary with eight threads and 
make sure that memory pages are mapped cyclically across 
four LDs (0-3). Initialization inside the program is then in- 
significant for the placement unless special libraries are used, 
so it may be done sequentially just as well. 

The purpose of this work is to demonstrate that a sim- 
ple user-level software layer can make close to optimal cc- 
NUMA page placement possible even with dynamic schedul- 
ing or tasking, by sorting tasks upon initialization into a num- 
ber of locality queues. We will show that our scheme works for 
OpenMP tasking and parallel TBB constructs, and compare it 
to the "affinity partitioner" in TBB Q, which has a similar 
purpose. Contrary to the assumption that tasking causes "ran- 
dom" page access, the order in which tasks are submitted to 
the execution thread pool can have a noticeable impact on per- 
formance. 

1.2 Related work 

Using the default first-touch policy with parallel initialization 
is a simple optimization technique for memory-bound shared- 
memory parallel code, but ccNUMA awareness is unfortu- 
nately not yet well established among application programmers 
in science and engineering. Moreover, although introducing 
multiple execution queues with a work-stealing scheme on top 
is not new, the possibilities for enhancing ccNUMA access lo- 
cality under dynamic task scheduling with user code only and 
within the capabilities of current compilers and OS environ- 
ments have not been explored in great detail. Most work con- 
centrates on low-level thread scheduling techniques for various 
threading models (mostly OpenMP and Cilk), either runtime- 
based |5 6 7|, OS-based |H|. or even hardware-based (5|. Au- 
tomatic page migration 1 10 1 can enhance locality significantly, 
but is again not generally applicable and must necessarily em- 
ploy, to varying extent, heuristic methods to decide about page 
placement. 



The method proposed here consists of a thin software layer 
that effectively modifies the task scheduling algorithm em- 
ployed by the compiler and runtime system, based on locality 
information that can either be supplied by the user or obtained 
automatically, depending on the situation. 

1.3 Test bed for performance measurements 

We have chosen three ccNUMA-type systems for perform- 
ing benchmarks (see Table [TJ. The six-core AMD "Istan- 
bul" (see Fig. [2]) and quad-core Intel "Nehalem EP" proces- 
sors (see Fig. \lj have been on the market for some time; the 
eight-core "Nehalem EX", however, has been introduced only 
recently. Our early-access Nehalem EX benchmark system 
was equipped with only half the maximum number of memory 
boards per socket, which leads to a reduction of the effective 
main memory bandwidth by a factor of two. Although of minor 
importance for the results presented here, this is of course not 
a desirable configuration for a production system. All systems 
ran current Linux kernels. The Intel C++ compiler in version 
11.1.064 and TBB version 3.0 (open source variant) were used 
for the benchmarks. 

All three systems have a similar maximum bandwidth as 
measured by the STREAM copy benchmark [ 1 1 1, which mod- 
els closely the memory access behavior of the Jacobi solver. 
Nontemporal stores ("NT") were used if appropriate; NT stores 
bypass the cache hierarchy and can improve store bandwidth 
by avoiding the write-allocate cache line transfer on store 
misses. 

1.4 Benchmarking procedure and baseline per- 
formance 

As a simple benchmark we choose a 3D six-point Jacobi solver 
with constant coefficients as recently studied extensively by 
Datta et al. 0. The site update function, 

F t+1 (iJ,k) = c-[F t (i-lJ,k)+F t {i+l,j,k) 
+ F t (i,j-l,k)+F t (i,j+l,k) 
+ F,(i,j,k-l)+F t (i,j,k + l)] , 

is evaluated for each lattice site in a 3D loop nest. Each site 
update (in the following called "LUP") incurs six loads and 
one store, of which, at large problem sizes, one load and one 
store cause main memory traffic if suitable spatial blocking is 
applied. This leads to a code balance of 8/3 bytes per flop (as- 
suming that nontemporal stores are used so that a store miss 
does not cause a cache line write-allocate transfer), so the code 
is clearly memory-bound on all current cache-based architec- 
tures. In what follows we use a problem size of 600 2 x 2400 
sites (« 13 GB of memory for both grids and double precision 
variables) and a blocksize of 600x lOx 100 (dkXdjxdi, with k 
being the inner [fast] index) sites, unless otherwise noted. This 
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Figure 3: Performance (median over 100 samples) of all code versions on the systems in the test bed (all cores utilized). At 
each bar, horizontal lines mark the full-node performance under standard OpenMP static worksharing with serial initialization 
in LDO (bottom), with round-robin page placement via numactl (middle), and with correct parallel first-touch placement (top). 
The labels below the columns denote static vs. static, 1 scheduling for the OpenMP initialization loop ("s" vs. "s-1"), different 
task submission orders for OpenMP tasking ("ijk" vs. "kji"), pinned vs. nonpinned TBB threads ("p" vs. "n-p"), and the use or 
omission of the affinity partitioner with TBB ("a" vs. "n-a"). 
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is close to the optimal block dimensions on all architectures 
considered here. In a standard OpenMP -parallel implementa- 
tion, the update loop nest iterates over all blocks in turn, and 
standard worksharing parallelization is done over the three col- 
lapsed blocking loops (first-touch initialization is performed 
via the identical scheme): 

#pragma omp parallel for \ 

collapse (3) s chedule ( runt ime ) 
for(int ib=0; ib<no_of _i_blocks ; ++ib) { 
for(int jb=0; jb <no_of _ j _blo cks ; ++jb) { 
fortint kb=0; kb < no_of _k_blocks ; ++kb) { 
jacobi_sweep_block(ib,jb,kb); 

} } } 

Note that with the standard k blocksize being equal to the 
extent of the lattice in that direction (which is required 
to make best use of the hardware prefetching capabilities 
on the processors used), no_of _k_blocks is equal to one. 
The jacobi_sweep_block() function performs one Jacobi 
sweep, i.e., one update per lattice site, over all sites in the 
block determined by its parameters. In case of dynamic loop 
scheduling there is a choice as to how parallel first-touch ini- 
tialization should be done; both static, 1 (round robin) and 
plain static scheduling will be investigated. 

Note that this simple benchmark is not a typical applica- 
tion scenario for tasking, since the load is evenly distributed 
and parallelization with standard OpenMP loop worksharing 
constructs is straightforward. However, it provides a well- 
controlled environment for showing the effects of dynamic 
scheduling and the limitations of runtime systems. Moreover, 
even applications with very regular access patterns can bene- 
fit from task-based parallelism, because functional decompo- 
sition into "communicating" and "computing" tasks is greatly 
simplified. This has been demonstrated recently in the con- 
text of a 3D particle-in-cell code 1121 . When using a thread- 
ing model together with message passing (MPI) in hybrid 
shared/distributed-memory programming it is also vital to re- 
duce per-node performance variations, since those will limit 
scalability of the whole application. We will briefly comment 
on this problem below. 

For OpenMP we enforced strict thread-core affinity in all 
benchmark runs by using the Linux sched_setaf f inity () 
function. In production environments, more user-friendly tools 
like hwloc 1131 or likwid-pin lfl4ll are certainly preferable. In 
TBB, the concept of a "thread" or its affinity to a piece of 
hardware is not made explicit for the programmer; a simple 
parallel_f or loop with the number of iterations equal to 
the number of spawned threads is repeated until each thread 
was assigned a "dummy" task for the sole purpose of calling 
sched_setaff inity () and establishing a fixed thread-core 
mapping. 

Impact of suboptimal page placement The horizontal lines 
in all panels of Fig. [3] illustrate the impact of suboptimal page 
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Figure 4: Performance variability with OpenMP tasking (left) 
and the TBB parallel_f or construct (right). Median, ±25%, 
and ±45% quantiles are indicated (100 samples each). 



placement on the solver's performance. The lowest perfor- 
mance is consistently achieved with purely sequential initial- 
ization, i.e., with a serial initialization loop, and static work- 
sharing. In this limit, the memory interface of a single LD be- 
comes a bottleneck and the cores in all but this single domain 
have to access their data via the ccNUMA network. Round- 
robin placement as established, e.g., with the numactl tool, 
and boosts performance significantly by enabling at least some 
level of parallelism. Optimal bandwidth utilization is of course 
reached with static, parallel first-touch placement, and comes 
close to the STREAM copy numbers in Table [T] On a UMA 
system (or within a single ccNUMA domain), all three lines 
would match. The penalty for round-robin placement is es- 
pecially large for the Nehalem EP system, since it has the 
strongest "NUMA effect" (bandwidth reduction for nonlocal 
access). On the other hand, the performance level for se- 
quential placement is particularly low on Nehalem EX, which 
can be attributed to the fact that our EA system is extremely 
bandwidth-starved due to the lack of half the memory boards 
per LD. 

Note that the impact of scheduling overhead is not investi- 
gated here. If the amount of work per task is small, dynamic 
scheduling can potentially be hazardous for performance Q. 
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2 Tasking with OpenMP 
2.1 Baseline 

In contrast to standard worksharing loop parallelization, task- 
ing in OpenMP requires to split the problem into a number of 
work "packages", called tasks, each of which must be submit- 
ted to an internal pool via the omp task directive. For the 
Jacobi solver we define one task to be a single block of the 
size specified above. This is in contrast to standard static loop 
worksharing, where one parallelized loop iteration consisted of 
several blocks with different coordinates. 

The tasks are produced ("submitted") by a single thread and 
consumed by all threads and in a 3D loop nest: 

#pragma omp parallel 
{ 

#pragma omp single 

{ 

for(int ib=0; ib <no_of _ i _blo cks ; ++ib) { 
for(int jb=0; jb < no_of _ j _bl ocks ; ++jb) { 
forCint kb=0; kb <no _of _k_blocks ; ++kb) { 
#pragma omp task 

jacobi_sweep_block(ib,jb,kb) ; 

} } } 

} 

} 

Submitting the tasks in parallel is possible but did not make 
any difference in the parameter ranges considered here. This 
parallel block is actually a "worksharing" construct, since all 
threads that are waiting in the implicit barrier at the end of the 
omp single construct execute tasks that have been submitted 
by the one thread that entered the single region. After finish- 
ing the submit loop nest, this thread will join the others. 

In contrast to the code above, which submits tasks in jb di- 
rection first ("ijk"; the single block in kb direction does not 
count), the loop nest order can be reversed ("kji"), leading to 
a functionally equivalent code. There is also a choice as to 
how first-touch initialization should be performed, so we com- 
pare static and static , 1 scheduling ("s" vs. "s-1") for loop 
initialization. The left column of panels in Fig. [3] shows per- 
formance results on all platforms. The four combinations of 
ijk/kji submit order with static/static, 1 initialization are indi- 
cated below the graph. In general, this code is never faster 
than standard static worksharing with round-robin placement. 
Combining static initialization with ijk submit order seems to 
be especially unfortunate. 

The large impact of submit and initialization orders can be 
explained by assuming that there is only a limited number of 
"queued", i.e., unprocessed tasks allowed at any time. In the 
course of executing the submission loop, this limit is reached 
very quickly and the submitting thread is used for processing 
tasks for some time. From our measurements, the limit is set to 
roughly 256 tasks with the compiler used (current GNU com- 
pilers have the same limit). One ib-jb layer of the grid com- 
prises 60 tasks (with the chosen problem and block sizes), and 



240 layers are available, which amounts to 14400 tasks in to- 
tal. With static scheduling on initialization, one block of 256 
consecutive tasks is usually associated with a single locality 
domain (rarely two), hence the serialization of memory access. 
Choosing static , 1 scheduling for initialization, each row of 
t consecutive blocks (t being the number of threads per socket) 
is placed into a different locality domain, but 256 tasks com- 
prise only slightly more than four layers. Assuming that the 
order of execution for tasks resembles static, 1 loop work- 
share scheduling because each thread is served a task in turn, 
the number of LDs to be accessed in parallel is limited (al- 
though it is hard to predict the actual level of parallelism, since 
it is also influenced by the number of threads per LD). Finally, 
by choosing the kji submission loop order, consecutive tasks 
cycle through locality domains, and parallelism is as expected 
from dynamic loop scheduling. In all cases, performance vari- 
ability is surprisingly small (see left panel in Fig. |4]i. 

These observations document that it is nontrivial to employ 
tasking on ccNUMA systems and reach at least the perfor- 
mance level of standard dynamic loop scheduling or round- 
robin page placement. In the next section we will demon- 
strate how task scheduling under locality constraints can be 
optimized by "overriding" part of the OpenMP task schedul- 
ing by user program logic. 

2.2 OpenMP tasking with locality queues 

Each task, which equals one lattice block (or tile) in our case, 
is associated with a C++ object (of type block) and equipped 
with an integer locality variable. This variable denotes the lo- 
cality domain the block was placed in upon initialization. The 
submission loop now takes the following form: 

#pragma omp parallel 
{ 

#pragma omp single 

{ 

for(int ib=0; ib <no_ of _ i _blo cks ; ++ib) { 
for(int jb=0; jb < no_of _ j _blocks ; ++jb) { 
for(int kb=0; kb <no _of _k_blocks ; ++kb) { 
block *b = blocks [ib] [jb] [kb] ; 
queues [b->localityO] . enqueue(b) ; 
#pragma omp task 

process_block_f rom_queue (queues) ; 

} } } 

} 

} 

The queues object is a std: :vector<> of std: :queue<> 
objects, each associated with one locality domain, and each 
protected from concurrent access via an OpenMP lock. Calling 
the enqueue ( ) method of a queue appends a block object to it. 
As shown above, blocks are sorted into those locality queues 
according to their respective locality variables. One OpenMP 
task, executed by the process_block_f rom_queue ( ) func- 
tion, now consists of two parts: 

1 . Figuring out which LD the executing thread belongs to 
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2. Dequeuing the oldest waiting block in the local- 
ity queue belonging to this domain and calling 
jacobi_sweep_block() for it 



If the local queue of a thread is empty, other queues are tried 
in a spin loop until a block is found ("work stealing"): 

void process_block_f rom_queue ( locality_queues \ 

fequeues) { 

// . . . 

bool f ound = f als e ; 
block *b; 

int Id = ld_ID [omp_get_thread_num () ] ; 
while ( ! f ound ) { 

found = queues [Id] . dequeue (p) ; 

if (! found) { 

Id = (Id + 1) queue s . s ize () ; 

} 

} 

j acobi_sweep_block (b->ib , b->jb , b->kb); 

} 

The global ld_ID vector must be preset with a correct mapping 
of thread numbers to locality domains. It is possible with the 
described scheme that some task executes a block just queued 
before the corresponding task is actually submitted. This is 
however not a problem because the number of submitted tasks 
is always equal to the number of queued blocks, and no task 
will ever be left waiting for new blocks forever. 

Note that scanning other queues if a thread's local queue is 
empty gives load balancing priority over strict access locality, 
which may or may not be desirable depending on the applica- 
tion. The team of threads in one locality domain shares one 
queue, so scheduling is still purely dynamic inside an LD. 

The second column of panels in Fig. [3] shows performance 
results: For static initialization and the ijk submission order, 
the limited overall number of waiting tasks has the same con- 
sequences as with plain tasking (see Sect. |2.1| i. In this case, 
although the queuing mechanism is in effect, a single queue 
holds most of the tasks at any point in time. All threads are 
served from this queue and thus mostly execute in a single LD. 
However, using the alternate kji submission order or static , 1 
initialization, all queues are fed in parallel and threads can al- 
ways be served tasks from their local queue. Performance then 
comes close to static scheduling within a 10 % margin. 

One should note that a similar effect could have been 
achieved with nested parallelism, using one thread per LD in 
the outer parallel region and several threads (one per core) in 
the nested region. However, we believe our approach to be 
more powerful and easier to apply if properly wrapped into 
C++ logic that takes care of affinity and work distribution. 
Moreover, the thread pooling strategies employed by many cur- 
rent compilers inhibit sensible affinity mechanisms when using 
nested OpenMP constructs. 



3 Tasking with TBB 

3.1 Baseline and affinity partitioner 

The universal TBB construct for task-parallel execution is 
the parallel_f or function. Initializing all blocks by "first 
touch" and performing a domain sweep looks as follows: 

tbb: :parallel_for( 

tbb: :blocked_range_3d<int>( 

0, no_of _i_blocks , 1, 

0, no_of _ j _blocks , 1, 

0, no_of _k_blocks , 1), 
touch_bl o ck ( blocks ) ); 

tbb: :parallel_for( 

tbb: :blocked_range_3d<int>( 

0, no_of _i_blocks , 1, 

0, no_of _ j _blocks , 1, 

0, no_of _k_blocks , 1), 
update_block (blocks ) ); 

The tbb: :blocked_range_3d<> object encodes the way the 
three-dimensional domain (of blocks) is cut into subdomains. 
Here we have specified that the smallest unit in each coordinate 
direction is a single block. In TBB the user must provide a 
C++ class that implements operator () (i.e., a functor), which 
takes a reference to the range object and performs the actual 
"work": 

class update_block 
{ 

blocks & m_blocks ; 

publ i c : 

update_block (blocks & b) 
: m_blocks (b) {} 

void operator () (tbb : : blocked_range_3d<int> 

& subrange) { 

// ... iteration loop nest 

// over subrange -> bi , bj , bk 

j acobi_sweep_block ( ib , jb, kb) ; 
// ... end iteration loop nest 

} 

// . . . 

}; 

The subrange parameter to the functor may encode a single 
block or a consecutive range of blocks along all coordinates; 
this is a decision made at runtime by TBB. 

The third column of panels in Fig.|3]shows performance re- 
sults for TBB with the scheme just described, comparing the 
situation with and without binding threads to cores ("p" vs. 
"n-p") and without using the affinity partitioner ("n-a", see be- 
low). Since first-touch placement is done via a parallel_f or 
loop, page mapping is dynamic and performance is close to the 
round-robin placement case with standard OpenMP workshar- 
ing, as expected. The mediocre results on the Istanbul system 
are surprising; it is as yet unclear why TBB should perform 
worse than OpenMP with our locality optimizations employed. 
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TBB provides a user-friendly way to specify that 
affinity information is important for performance. 
The tbb: :parallel_f or function takes an op- 
tional "partitioner" argument, which can be set to 
tbb : : af f inity_partitioner. In this case TBB stores 
information about thread-task affinity in an internal data struc- 
ture on the first call to tbb: :parallel_f or. On subsequent 
parallel loops, the scheduler tries to map tasks to the same 
threads as before, thereby establishing access locality auto- 
matically. The affinity partitioner must thus be specified on 
both the initialization and update loops. The third column of 
panels shows performance results with this optimization ("a"), 
with and without binding threads to cores ("p" vs. "n-p"). 
Obviously the affinity partitioner can significantly improve 
locality of access and is able to match the performance of 
OpenMP tasking with locality queues. 

3.2 TBB tasking with locality queues 

It is possible to adapt the locality queue mechanism to TBB 
as well, by letting the update_block() functor enqueue the 
blocks in the assigned subrange into the appropriate local- 
ity queues, and updating the same number of blocks (prefer- 
ably) from the executing thread's local queue. Instead of 
std:queue<>, the tbb: : concurrent_queue<> container is 
used here since it provides automatic fine-grained locking. 
However, the performance benefit compared to the affinity 
partitioner is marginal (see the fourth column of panels in 
Fig . [3j . This can be attributed to the fact that submission order 
(as defined in the OpenMP tasking versions) cannot be con- 
trolled in this setting. Using a one-dimensional partitioner or 
a parallel_do construct could enable finer control over page 
placement, but the expected additional benefit is small. 

4 Summary and outlook 

We have demonstrated how locality queues can be employed 
to optimize parallel memory access on ccNUMA systems 
when OpenMP tasking or the TBB parallel_f or construct 
is used. Locality queues substitute the uncontrolled, dynamic 
task scheduling by a static and a dynamic part. The latter is 
mostly restricted to the cores in one NUMA domain, provid- 
ing full dynamic load balancing on the locality domain (LD) 
level. Scheduling between domains is static, but load balanc- 
ing is given priority over strictly local access by a work steal- 
ing scheme. The larger the number of threads per LD, the more 
dynamic the task distribution, so our scheme will get more in- 
teresting in view of future many-core processors. Using lo- 
cality queues with TBB's parallel_f or construct does not 
outperform the built-in affinity partitioner, but the impact on 
parallel_do cannot be inferred from this result, and is yet to 
be investigated. Note that the concept would in principle work 



also without thread-core affinity because the current locality 
domain ID of a thread could be determined at any time, and 
the static mapping of threads to LDs would become obsolete. 

Future work encompasses the application of the concept 
to real application codes, notably sparse matrix eigenvalue 
solvers, where load balancing and overlapping computation 
with communication may be achieved in a natural way by task- 
ing. Further potentials, not restricted to ccNUMA architec- 
tures, may be found in the possibility to implement temporal 
blocking (doing more than one time step on a block to reduce 
pressure on the memory subsystem 1 15 1) by associating one lo- 
cality queue to a number of cores that share a cache level. As 
an advantage over static temporal blocking, no frequent global 
barriers would be required. 
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